## Id MSSubClass MSZoning LotFrontage LotArea Street Alley LotShape
## 1 1 60 RL 65 8450 Pave <NA> Reg
## 2 2 20 RL 80 9600 Pave <NA> Reg
## 3 3 60 RL 68 11250 Pave <NA> IR1
## 4 4 70 RL 60 9550 Pave <NA> IR1
## 5 5 60 RL 84 14260 Pave <NA> IR1
## 6 6 50 RL 85 14115 Pave <NA> IR1
## LandContour Utilities LotConfig LandSlope Neighborhood Condition1
## 1 Lvl AllPub Inside Gtl CollgCr Norm
## 2 Lvl AllPub FR2 Gtl Veenker Feedr
## 3 Lvl AllPub Inside Gtl CollgCr Norm
## 4 Lvl AllPub Corner Gtl Crawfor Norm
## 5 Lvl AllPub FR2 Gtl NoRidge Norm
## 6 Lvl AllPub Inside Gtl Mitchel Norm
## Condition2 BldgType HouseStyle OverallQual OverallCond YearBuilt
## 1 Norm 1Fam 2Story 7 5 2003
## 2 Norm 1Fam 1Story 6 8 1976
## 3 Norm 1Fam 2Story 7 5 2001
## 4 Norm 1Fam 2Story 7 5 1915
## 5 Norm 1Fam 2Story 8 5 2000
## 6 Norm 1Fam 1.5Fin 5 5 1993
## YearRemodAdd RoofStyle RoofMatl Exterior1st Exterior2nd MasVnrType
## 1 2003 Gable CompShg VinylSd VinylSd BrkFace
## 2 1976 Gable CompShg MetalSd MetalSd None
## 3 2002 Gable CompShg VinylSd VinylSd BrkFace
## 4 1970 Gable CompShg Wd Sdng Wd Shng None
## 5 2000 Gable CompShg VinylSd VinylSd BrkFace
## 6 1995 Gable CompShg VinylSd VinylSd None
## MasVnrArea ExterQual ExterCond Foundation BsmtQual BsmtCond BsmtExposure
## 1 196 Gd TA PConc Gd TA No
## 2 0 TA TA CBlock Gd TA Gd
## 3 162 Gd TA PConc Gd TA Mn
## 4 0 TA TA BrkTil TA Gd No
## 5 350 Gd TA PConc Gd TA Av
## 6 0 TA TA Wood Gd TA No
## BsmtFinType1 BsmtFinSF1 BsmtFinType2 BsmtFinSF2 BsmtUnfSF TotalBsmtSF
## 1 GLQ 706 Unf 0 150 856
## 2 ALQ 978 Unf 0 284 1262
## 3 GLQ 486 Unf 0 434 920
## 4 ALQ 216 Unf 0 540 756
## 5 GLQ 655 Unf 0 490 1145
## 6 GLQ 732 Unf 0 64 796
## Heating HeatingQC CentralAir Electrical X1stFlrSF X2ndFlrSF LowQualFinSF
## 1 GasA Ex Y SBrkr 856 854 0
## 2 GasA Ex Y SBrkr 1262 0 0
## 3 GasA Ex Y SBrkr 920 866 0
## 4 GasA Gd Y SBrkr 961 756 0
## 5 GasA Ex Y SBrkr 1145 1053 0
## 6 GasA Ex Y SBrkr 796 566 0
## GrLivArea BsmtFullBath BsmtHalfBath FullBath HalfBath BedroomAbvGr
## 1 1710 1 0 2 1 3
## 2 1262 0 1 2 0 3
## 3 1786 1 0 2 1 3
## 4 1717 1 0 1 0 3
## 5 2198 1 0 2 1 4
## 6 1362 1 0 1 1 1
## KitchenAbvGr KitchenQual TotRmsAbvGrd Functional Fireplaces FireplaceQu
## 1 1 Gd 8 Typ 0 <NA>
## 2 1 TA 6 Typ 1 TA
## 3 1 Gd 6 Typ 1 TA
## 4 1 Gd 7 Typ 1 Gd
## 5 1 Gd 9 Typ 1 TA
## 6 1 TA 5 Typ 0 <NA>
## GarageType GarageYrBlt GarageFinish GarageCars GarageArea GarageQual
## 1 Attchd 2003 RFn 2 548 TA
## 2 Attchd 1976 RFn 2 460 TA
## 3 Attchd 2001 RFn 2 608 TA
## 4 Detchd 1998 Unf 3 642 TA
## 5 Attchd 2000 RFn 3 836 TA
## 6 Attchd 1993 Unf 2 480 TA
## GarageCond PavedDrive WoodDeckSF OpenPorchSF EnclosedPorch X3SsnPorch
## 1 TA Y 0 61 0 0
## 2 TA Y 298 0 0 0
## 3 TA Y 0 42 0 0
## 4 TA Y 0 35 272 0
## 5 TA Y 192 84 0 0
## 6 TA Y 40 30 0 320
## ScreenPorch PoolArea PoolQC Fence MiscFeature MiscVal MoSold YrSold
## 1 0 0 <NA> <NA> <NA> 0 2 2008
## 2 0 0 <NA> <NA> <NA> 0 5 2007
## 3 0 0 <NA> <NA> <NA> 0 9 2008
## 4 0 0 <NA> <NA> <NA> 0 2 2006
## 5 0 0 <NA> <NA> <NA> 0 12 2008
## 6 0 0 <NA> MnPrv Shed 700 10 2009
## SaleType SaleCondition SalePrice
## 1 WD Normal 208500
## 2 WD Normal 181500
## 3 WD Normal 223500
## 4 WD Abnorml 140000
## 5 WD Normal 250000
## 6 WD Normal 143000
This is the USA Housing dataset - training data - that was downloaded from the Kaggle website (https://www.kaggle.com/gpandi007/usa-housing-dataset). The data contains different sale prices for houses in USA.
## [1] 1460 81
The data contained 79 attributes (other than id and sale price) for 1460 houses.
## Id MSSubClass MSZoning LotFrontage
## Min. : 1.0 Min. : 20.0 C (all): 10 Min. : 21.00
## 1st Qu.: 365.8 1st Qu.: 20.0 FV : 65 1st Qu.: 59.00
## Median : 730.5 Median : 50.0 RH : 16 Median : 69.00
## Mean : 730.5 Mean : 56.9 RL :1151 Mean : 70.05
## 3rd Qu.:1095.2 3rd Qu.: 70.0 RM : 218 3rd Qu.: 80.00
## Max. :1460.0 Max. :190.0 Max. :313.00
## NA's :259
## LotArea Street Alley LotShape LandContour
## Min. : 1300 Grvl: 6 Grvl: 50 IR1:484 Bnk: 63
## 1st Qu.: 7554 Pave:1454 Pave: 41 IR2: 41 HLS: 50
## Median : 9478 NA's:1369 IR3: 10 Low: 36
## Mean : 10517 Reg:925 Lvl:1311
## 3rd Qu.: 11602
## Max. :215245
##
## Utilities LotConfig LandSlope Neighborhood Condition1
## AllPub:1459 Corner : 263 Gtl:1382 NAmes :225 Norm :1260
## NoSeWa: 1 CulDSac: 94 Mod: 65 CollgCr:150 Feedr : 81
## FR2 : 47 Sev: 13 OldTown:113 Artery : 48
## FR3 : 4 Edwards:100 RRAn : 26
## Inside :1052 Somerst: 86 PosN : 19
## Gilbert: 79 RRAe : 11
## (Other):707 (Other): 15
## Condition2 BldgType HouseStyle OverallQual
## Norm :1445 1Fam :1220 1Story :726 Min. : 1.000
## Feedr : 6 2fmCon: 31 2Story :445 1st Qu.: 5.000
## Artery : 2 Duplex: 52 1.5Fin :154 Median : 6.000
## PosN : 2 Twnhs : 43 SLvl : 65 Mean : 6.099
## RRNn : 2 TwnhsE: 114 SFoyer : 37 3rd Qu.: 7.000
## PosA : 1 1.5Unf : 14 Max. :10.000
## (Other): 2 (Other): 19
## OverallCond YearBuilt YearRemodAdd RoofStyle
## Min. :1.000 Min. :1872 Min. :1950 Flat : 13
## 1st Qu.:5.000 1st Qu.:1954 1st Qu.:1967 Gable :1141
## Median :5.000 Median :1973 Median :1994 Gambrel: 11
## Mean :5.575 Mean :1971 Mean :1985 Hip : 286
## 3rd Qu.:6.000 3rd Qu.:2000 3rd Qu.:2004 Mansard: 7
## Max. :9.000 Max. :2010 Max. :2010 Shed : 2
##
## RoofMatl Exterior1st Exterior2nd MasVnrType MasVnrArea
## CompShg:1434 VinylSd:515 VinylSd:504 BrkCmn : 15 Min. : 0.0
## Tar&Grv: 11 HdBoard:222 MetalSd:214 BrkFace:445 1st Qu.: 0.0
## WdShngl: 6 MetalSd:220 HdBoard:207 None :864 Median : 0.0
## WdShake: 5 Wd Sdng:206 Wd Sdng:197 Stone :128 Mean : 103.7
## ClyTile: 1 Plywood:108 Plywood:142 NA's : 8 3rd Qu.: 166.0
## Membran: 1 CemntBd: 61 CmentBd: 60 Max. :1600.0
## (Other): 2 (Other):128 (Other):136 NA's :8
## ExterQual ExterCond Foundation BsmtQual BsmtCond BsmtExposure
## Ex: 52 Ex: 3 BrkTil:146 Ex :121 Fa : 45 Av :221
## Fa: 14 Fa: 28 CBlock:634 Fa : 35 Gd : 65 Gd :134
## Gd:488 Gd: 146 PConc :647 Gd :618 Po : 2 Mn :114
## TA:906 Po: 1 Slab : 24 TA :649 TA :1311 No :953
## TA:1282 Stone : 6 NA's: 37 NA's: 37 NA's: 38
## Wood : 3
##
## BsmtFinType1 BsmtFinSF1 BsmtFinType2 BsmtFinSF2
## ALQ :220 Min. : 0.0 ALQ : 19 Min. : 0.00
## BLQ :148 1st Qu.: 0.0 BLQ : 33 1st Qu.: 0.00
## GLQ :418 Median : 383.5 GLQ : 14 Median : 0.00
## LwQ : 74 Mean : 443.6 LwQ : 46 Mean : 46.55
## Rec :133 3rd Qu.: 712.2 Rec : 54 3rd Qu.: 0.00
## Unf :430 Max. :5644.0 Unf :1256 Max. :1474.00
## NA's: 37 NA's: 38
## BsmtUnfSF TotalBsmtSF Heating HeatingQC CentralAir
## Min. : 0.0 Min. : 0.0 Floor: 1 Ex:741 N: 95
## 1st Qu.: 223.0 1st Qu.: 795.8 GasA :1428 Fa: 49 Y:1365
## Median : 477.5 Median : 991.5 GasW : 18 Gd:241
## Mean : 567.2 Mean :1057.4 Grav : 7 Po: 1
## 3rd Qu.: 808.0 3rd Qu.:1298.2 OthW : 2 TA:428
## Max. :2336.0 Max. :6110.0 Wall : 4
##
## Electrical X1stFlrSF X2ndFlrSF LowQualFinSF
## FuseA: 94 Min. : 334 Min. : 0 Min. : 0.000
## FuseF: 27 1st Qu.: 882 1st Qu.: 0 1st Qu.: 0.000
## FuseP: 3 Median :1087 Median : 0 Median : 0.000
## Mix : 1 Mean :1163 Mean : 347 Mean : 5.845
## SBrkr:1334 3rd Qu.:1391 3rd Qu.: 728 3rd Qu.: 0.000
## NA's : 1 Max. :4692 Max. :2065 Max. :572.000
##
## GrLivArea BsmtFullBath BsmtHalfBath FullBath
## Min. : 334 Min. :0.0000 Min. :0.00000 Min. :0.000
## 1st Qu.:1130 1st Qu.:0.0000 1st Qu.:0.00000 1st Qu.:1.000
## Median :1464 Median :0.0000 Median :0.00000 Median :2.000
## Mean :1515 Mean :0.4253 Mean :0.05753 Mean :1.565
## 3rd Qu.:1777 3rd Qu.:1.0000 3rd Qu.:0.00000 3rd Qu.:2.000
## Max. :5642 Max. :3.0000 Max. :2.00000 Max. :3.000
##
## HalfBath BedroomAbvGr KitchenAbvGr KitchenQual
## Min. :0.0000 Min. :0.000 Min. :0.000 Ex:100
## 1st Qu.:0.0000 1st Qu.:2.000 1st Qu.:1.000 Fa: 39
## Median :0.0000 Median :3.000 Median :1.000 Gd:586
## Mean :0.3829 Mean :2.866 Mean :1.047 TA:735
## 3rd Qu.:1.0000 3rd Qu.:3.000 3rd Qu.:1.000
## Max. :2.0000 Max. :8.000 Max. :3.000
##
## TotRmsAbvGrd Functional Fireplaces FireplaceQu GarageType
## Min. : 2.000 Maj1: 14 Min. :0.000 Ex : 24 2Types : 6
## 1st Qu.: 5.000 Maj2: 5 1st Qu.:0.000 Fa : 33 Attchd :870
## Median : 6.000 Min1: 31 Median :1.000 Gd :380 Basment: 19
## Mean : 6.518 Min2: 34 Mean :0.613 Po : 20 BuiltIn: 88
## 3rd Qu.: 7.000 Mod : 15 3rd Qu.:1.000 TA :313 CarPort: 9
## Max. :14.000 Sev : 1 Max. :3.000 NA's:690 Detchd :387
## Typ :1360 NA's : 81
## GarageYrBlt GarageFinish GarageCars GarageArea GarageQual
## Min. :1900 Fin :352 Min. :0.000 Min. : 0.0 Ex : 3
## 1st Qu.:1961 RFn :422 1st Qu.:1.000 1st Qu.: 334.5 Fa : 48
## Median :1980 Unf :605 Median :2.000 Median : 480.0 Gd : 14
## Mean :1979 NA's: 81 Mean :1.767 Mean : 473.0 Po : 3
## 3rd Qu.:2002 3rd Qu.:2.000 3rd Qu.: 576.0 TA :1311
## Max. :2010 Max. :4.000 Max. :1418.0 NA's: 81
## NA's :81
## GarageCond PavedDrive WoodDeckSF OpenPorchSF EnclosedPorch
## Ex : 2 N: 90 Min. : 0.00 Min. : 0.00 Min. : 0.00
## Fa : 35 P: 30 1st Qu.: 0.00 1st Qu.: 0.00 1st Qu.: 0.00
## Gd : 9 Y:1340 Median : 0.00 Median : 25.00 Median : 0.00
## Po : 7 Mean : 94.24 Mean : 46.66 Mean : 21.95
## TA :1326 3rd Qu.:168.00 3rd Qu.: 68.00 3rd Qu.: 0.00
## NA's: 81 Max. :857.00 Max. :547.00 Max. :552.00
##
## X3SsnPorch ScreenPorch PoolArea PoolQC
## Min. : 0.00 Min. : 0.00 Min. : 0.000 Ex : 2
## 1st Qu.: 0.00 1st Qu.: 0.00 1st Qu.: 0.000 Fa : 2
## Median : 0.00 Median : 0.00 Median : 0.000 Gd : 3
## Mean : 3.41 Mean : 15.06 Mean : 2.759 NA's:1453
## 3rd Qu.: 0.00 3rd Qu.: 0.00 3rd Qu.: 0.000
## Max. :508.00 Max. :480.00 Max. :738.000
##
## Fence MiscFeature MiscVal MoSold
## GdPrv: 59 Gar2: 2 Min. : 0.00 Min. : 1.000
## GdWo : 54 Othr: 2 1st Qu.: 0.00 1st Qu.: 5.000
## MnPrv: 157 Shed: 49 Median : 0.00 Median : 6.000
## MnWw : 11 TenC: 1 Mean : 43.49 Mean : 6.322
## NA's :1179 NA's:1406 3rd Qu.: 0.00 3rd Qu.: 8.000
## Max. :15500.00 Max. :12.000
##
## YrSold SaleType SaleCondition SalePrice
## Min. :2006 WD :1267 Abnorml: 101 Min. : 34900
## 1st Qu.:2007 New : 122 AdjLand: 4 1st Qu.:129975
## Median :2008 COD : 43 Alloca : 12 Median :163000
## Mean :2008 ConLD : 9 Family : 20 Mean :180921
## 3rd Qu.:2009 ConLI : 5 Normal :1198 3rd Qu.:214000
## Max. :2010 ConLw : 5 Partial: 125 Max. :755000
## (Other): 9
A look at the distribution of Sale Prices.
The distribution is right skewed and it seemed appropriate to carry out a log transformation. The next plot looks at the distribution of the log transformed prices.
The log transformed distribution follows an almost normal distribution. Thus for further analyses, I decided to use the log-transformed sale prices.
MSSubClass defines the type of dwelling and MSZoning identifies the general zoning in the sale.
The classes for the MSSubClass is available in the annex. MSSubClass is a combination of YearBuilt, HouseStyle and BldgType. Since it is a combination, I will be ignoring this variable for further analyses.
The MSZoning plot shows that most of the houses sold were of Residential Low Density zoning. Thus, I won’t be considering this variable either for further analyses.
LotFrontage is the linear feet of street connected to property. LotArea is the size of the lot sold in square feet. LotShape describes the shape of the lot in terms of regularity. LotConfig gives the configuration of the lot sold.
The LotFrontage plot shows a right tailed distribution which is similar to sale price distribution. This requires further analyses.
The LotArea plot shows a very long tailed distribution but doesn’t really explain much. I decided to create a new variable categorising the data in the next section.
As expected most of the houses had a lot shape were regular. Just to explore the effect of the shape, I will be considering this for further analyses.
Most of the properties sold were on the inside lot. Again to explore the lot configuration, I will be considering this for further analyses.
## Min. 1st Qu. Median Mean 3rd Qu. Max.
## 1300 7554 9478 10517 11602 215245
This plot looks more interesting and there could be a relationship of sale price with this variable.
Street gives type of road access and Alley describes the type of alley access to the property.
As most houses have a Paved Street and don’t have an alley, these variables will not be considered for further analyses.
LandContour describes the flatness of the property and LandSlope dercibes the slope.
Most of the houses were on near flat or level land and most properties had a gentle slope. So I will not be considering these for further analyses.
Utilities describe the types available and Conditions describe proximity to various conditions.
All these variables will be ignored from further analyses as there is not any variation in the data. Further description of the different conditions can be found in the annex.
Neighborhood describes the physical locations within Ames city limits
It will be interesting to see the effect of the different neighborhoods in terms of age and sale price. I hypothesize that the newer neighborhoods would have a higher sale price in comparison to the older neighborhoods.
BldgType describes the type of dwelling while HouseStyle describes the style.
Most of the houses in this dataset were single family homes. Also to note there were two different categories of townhouses. Although there is not much variation, it will be good to check the effect of the building type on sale price.
The houses were mostly single storey followed by 2 storeyes. I will be ignoring this variable in further analyses.
OverallQual rates the overall material and finish of the house and OverallCond rates the overall condition.
Most of the houses had an Overall Quality of 5 and 50% of the houses were in the range of 5 to 7. While for Overall Condition, most houses were 5. I will be looking at the effect of both these variables with SalePrice.
YearBuilt is the year the house was built and YearRemodAdd is the year any remodelling or additions happened. If no renovations were done the year would be same for both the variables.
One would expect the older houses to cost less but since the data looks mainly at 2000-2005 built homes and the older homes are likely to have been remodelled, this variable may not be of much significance.
If the houses that were built in the early years were remodelled, that might need to be factored in.
These variables look at the roof style & material.
## [1] "Roof Material"
## ClyTile CompShg Membran Metal Roll Tar&Grv WdShake WdShngl
## 1 1434 1 1 1 11 5 6
## [1] "Roof Style"
## Flat Gable Gambrel Hip Mansard Shed
## 13 1141 11 286 7 2
I will ignoring these variables as there is not much variation in the results.
These variables categorises the exterior coverings of the houses sold.
These plots shows that the exterior covering is mainly vinyl siding. As there is some variation in the values, I will be checking it’s effect with sale price.
These variables describe the masonry veneer type and area.
Since most of the houses had no masonry veneer, I shall be ignoring both these variables.
The quality and condition of the exterior for the houses are described in these variables.
Although the variation is limited, I would like to see their relationship with sale price and overall quality and condition.
Foundation describes the type of foundation of the houses sold.
I wonder if any of these foundations increased the sale price significantly?
BsmtQual and BsmtCond describes the quality and condition of the basement (if available) for the houses sold.
I shall keep this variable to see if there’s any effect on sale price.
BsmtExposure refers to walkout or garden level walls.
I will be ignoring this variable as there is not much variation.
Ratings and square footage of Basement Finished Area are described below.
The unfinished square footage of basement.
I will be ignoring the square footage as there is a variable with total square footage of the basement which I would like to consider.
I wonder if having living quarters or a rec room in the basement increases the sale price. Therefore I am converting the Finished Basement Types to a new variable called FinBsmt. Also to check if have living quarters over Rec room made a difference to the cost, I added a new variable, LivOrRec.
I think it would be nice to see the effect of having a living quarters or recreation room on the price.
These variable talk about the type of heating and the quality and condition of the heating.
Although there is not much variation in the type of heating, I shall keep it to see if there’s a change in type dependent on the Year Remodelled. HeatingQC might have an effect on the price of the houses especially if they are sold in the winter months.
CentralAir tells whether the house has central air conditioning. Electrical describe the type of electrical system installed.
I wonder the effect of central air conditioning on price and the electrical system on the year remodelled.
These variables give the square footage of the first and second floors.
LowQualFinSF gives the Low quality finished square feet (all floors).
The above plot does not show much information as most houses had 0 sq ft that was low quality finished. Thus, in the next plot I subsetted the data to those with more than 0 sq ft and looked at the distribution in the form of a box plot.
Since the next variable, GrLivArea is the sum of the above variables (X1stFlrSF, X2ndFlrSF and LowQualFinSF), I will not be considering these variables for the rest of the analyses.
Since square footage labelled as “GrLivArea” is probably one of the important factors for the pricing, I plotted its distribution.
Looking at the distribution, I decided to categorize the variable by binning them in to categories and create a new variable. To understand the distribution, I looked at the summary of the data and then categorised them.
## Min. 1st Qu. Median Mean 3rd Qu. Max.
## 334 1130 1464 1515 1777 5642
Looking at the summary, I decided to bin the square footage in to 5 categories:
These variables give the number of full baths above grade in basement and otherwise.
These variables give the number of half baths above grade in basement and otherwise.
I would like to see the effect of the number of baths (full / half) on price but with a new variable “TotalBaths”.
BedroomAbvGr and TotRmsAbvGrd describe number of bedrooms above grade (does NOT include basement bedrooms) & total rooms above grade (does not include bathrooms).
Just to see the effect of the number of bedrooms and total rooms above grade on sale price I shall take it to the next analyses.
These variables describe the number of kitchens above grade and quality of kitchen.
Does having more than one kitchen or quality of kitchen have an effect on price?
Functional describes the level of Functionality of the house.
I shall ignore this variable as most houses have typical functionality.
These variables describe the number of fireplaces and their quality.
Similar to heating, I wonder if having a fireplace increased the sale price in winter months. I shall be ignoring the fireplace quality.
This variable describes the type of garage.
Most of the Garage Type was Attached (Attchd) followed by Detached (Detchd). Will the type of garage affect sale price?
These variables describe the garage was built and its finish.
It will be interesting to compare the year the garage was built with the year the house was remodelled and the finish of the garage with the sale price.
The number of cars in the garage and the area of the garage are described by GarageCars and GarageArea respectively.
Number of cars in the garage and area of the garage could be deciding factors for the sale of a house, I shall be exploring these further.
The plots for these two variables reveal mostly “Typical/Average” and therefore I am not considering them for further analyses.
Since the variation in these variables are minimal, they will not be considered for further analyses.
Since the porches are not part of all the houses, I have plotted only houses that have square footage greater than 0.
The above features will not be considered for further analyses.
## [1] "Number of houses with pools = 7"
Since only 7 houses have a pool, I am not considering these variables.
## [1] "Percentage of houses with fence= 19.2465753424658"
Less than 20% of the houses have a fence, thus not considering this feature for further analyses.
Since miscellaneous features are not present in most houses sold, MiscVal>0 and MiscFeature!=None only considered
The above variables will not be considered for further analyses.
Most sales seem to have happened during the summer months of June and July as expected.
The data is incomplete for 2010. Since the number of houses sold between 2006 and 2009 were similar, this variable should account for the market fluctuation if any in the real estate market and interesting to explore.
I note that this data may not be real, as in 2008 the real estate is known to have crashed and did not really gain before 2010. This does not show up in the graphs for the number of houses sold during that period. I wonder if it will be obvious when compared to the sale prices during that time.
This plot clearly shows that the 2010 data was only upto beginning of July. Another insight is that June & July are the most popular months for buying. I wonder if the sale prices had any trends during the year.
This variable doesn’t have much variation and thus will be ignored.
The Sale Condition as expected was Normal for most of the houses and thus shall be ignored.
There are 1,460 houses in the dataset with 81 features. I focused on 39 features (5 of which were derived). The OverallQual and OverallCond variables can be considered as ordered factor variables with the 1 being the worst and 10 being the best. Other observations: * The median sale price is $163,000 and the maximum price is $755,000. Looking at the distribution, I log-transformed the sale price. * The lot area variable was categorized and most houses had a lot area between 5000 and 10000 square feet. The distribution of lot frontage was right tailed. Some lots were irregular in shape, while most of the configuration of the property were inside lots. * North Ames neighbourhood had the most number of sales during the period. * Most of the houses sold were single family homes. Townhomes were either end or inside unit. * Most of the houses had an Overall Quality of 5 and 50% of the houses were in the range of 5 to 7. The Overall Condition of the houses had a peak at 5. Most sales seem to have happened during the summer months of June and July. The data is incomplete for 2010. I also note that this data may not be real, as in 2008 the real estate is known to have crashed and did not really gain before 2010. This does not show up in the graphs for the number of houses sold during that period. I wonder if it will be obvious when compared to the sale prices during that time The data looks mainly at 2000 to 2005 built houses. The square footage of the houses sold were binned in to 5 different categories. Most of the houses sold were in the 1500 to 2000 square feet range. The number of cars in the garage that could be accommodated varied from 0 to 4 cars and the area of the garage from 0 to almost 1500 square feet. I also note the garages might have been built on a later date and may not be completely finished. There is not much variation in the garage type but whether it was attached or detached could affect sale prices. Interestingly some houses had more than one kitchen and their qualities varied. The number of bedrooms above grade ranged between 0 to 8, while the total number of rooms above grade varied from 2 to 14. A new variable with the total number of bathrooms were computed and ranged from 1 to 6. Most houses had central air conditioning but it would be interesting its effect on sale price. The electrical system and the type of heating varied probably based on the year remodeled. Since the heating quality varied, I wonder if it affects the sale price especially if sold in the winter months. Two new variables were compiled from the basement variables. FinBsmt records whether the basement was finished or not and LivOrRec variable records whether the basement is living quarters or a recreation room or both. The total basement square footage ranged from 0 to 6000 square feet. There was some variability in basement quality and condition. The top two main foundation of the houses sold were poured concrete and cinder block. Exterior Quality and Condition were mostly typical / average. Most houses had vinyl siding but could also have had wood siding, cement board, metal siding or plywood.
The main features in the data set are square footage and sale price. I wonder which other features will contribute for predicting the sale price of the house. I believe it will be a combination of the features that can be used to build a predictive model to sale price of houses.
I think square footage, month sold, year sold, neighborhood, year remodelled, overall quality, overall condition, garage type and building type will help determine the sale price of the house sold.
I created a few new variables from existing variables - 1) TotalBaths - The total number of baths. 2) LotAreaCat - categorised lot area 3) LivAreaCat - categorised square footage 4) FinBsmt - basement finish 5) LivOrRec - Living Quaters or Recreation room in the basement.
I found the sale price had a skewed distribution and to tidy it I carried out a log transformation. The sale price data now follows a normal distribution. I also subsetted the data to only include the features of interest and the unique identifier for each house sold.
In an effort to look at the effect of the variables of interest (which had an inherent logical order) on Sale Price, I did 3 matrix plots as below. I used the correlations of values greater than 0.6 to be considered further.
In this plot the most intersting correlations were bewteen log10(SalePrice) and other features such as OverallQual(0.82), TotalBaths(0.67), LotAreaCat(0.69) and LivAreaCat(0.69). LivAreaCat and LotAreaCat had a correlation of 1 but this is not suprising. I had expected some correlation between OverallQual and OverallCond but this was not the case. Similarly, the correlation between SalePrice and YearBuilt or YearRemodAdd were only 0.59 and 0.57 respectvely and a correlation of 0.59 between them.
In this plot, there is a correlation of 0.68, 0.65 and 0.61 for ExterQual, BsmtQual and TotBsmtSqft respectively with log10(SalePrice). Also to note there was a correlation of 0.64 between ExterQual and BsmtQual.Interstingly, the correlation between TotRmsAbvGrd and log10(SalePrice) was negligible.
In this plot, there is a correlation of 0.67, 0.68 and 0.65 for KitchenQual, GarageCars and GarageArea with log10(SalePrice). Also to note there is an expected high correlation of 0.88 between GarageCars and GarageArea. Interestingly, the TotRmsAbvGrade has only a correlation of 0.53 with log10(SalePrice). It also seems that the month and year for the sale has no effect on price. The GarageYrBlt (though minimal) seem to have had some influence on GarageFinish (0.53), GarageCars (0.59) and GarageArea (0.56).
Since most of the quality features had a high correlation with SalePrice, I thought it was important to look at the correlations between these variables.
This plot shows that all the quality variables are corrlated at a correlation coefficient > 0.65. This makes me wonder if OverallQual explain most of the variation in SalePrice.
Similarly, I decided to look at the features that look at the area for the houses sold.
This plots shows that not all the ‘area’ features are as correlated as the ‘quality’ features except LotAreaCat and LivAreaCat that we had noted before.
The last one I wanted to look at was the correlation between TotalBaths, GarageCars, TotRmsAbvGrade, LivAreaCat.
I had expected a higher coorelation than 0.46 between TotalBaths and TotRmsAbvGrade. There is a high corrlation between LivAreaCat and TotRmsAbvGrade as expected.
For the other categorical variables I looked at boxplots ordered on median sale price for each category of each variable.
In the above boxplots, each feature was sorted for median SalePrice of each category of the feature and plotted. The median SalePrice was also plotted as a red dashed line.
The features, LivOrRec, LotShape, LotConfig, BldgType, Exterior1st, Exterior2nd, Foundation, Heating and Electrical seem to have no significant effect on SalePrice. Thus I won’t be considering this further.
As expected, Neighborhood has an effect on SalePrice. Also to note, houses with Builtin & Attached Garages had higher SalePrice compared to the rest.
I decided to extract the coefficients of a linear model of LogSalePrice ~ OverallQual and plot it again.
##
## Call:
## lm(formula = LogSalePrice ~ OverallQual, data = housing_interest)
##
## Residuals:
## Min 1Q Median 3Q Max
## -0.46396 -0.05634 0.00569 0.05790 0.40145
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 4.596765 0.011842 388.18 <2e-16 ***
## OverallQual 0.102506 0.001893 54.14 <2e-16 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 0.1 on 1458 degrees of freedom
## Multiple R-squared: 0.6678, Adjusted R-squared: 0.6676
## F-statistic: 2931 on 1 and 1458 DF, p-value: < 2.2e-16
The highest correlation with log10(sale price) of 0.82 was with OverallQual. This was followed by ExterQual (0.68), LivAreaCat (0.69), LotAreaCat (0.69), KitchenQual (0.67), BsmtQual (0.65), GarageCars (0.68), TotalBaths (0.67) , GarageArea (0.65) and TotBsmtSqft (0.61). But it was important to note that they are not all are independent and have some correlation between them like LivAreaCat and LotAreaCat are completely correlated with a correlation coefficient of 1. Neighborhood has an upward trend on SalePrice. Also to note, houses with Builtin & Attached Garages had higher SalePrice compared to the rest.
There was no correlation between OverallCond and OverallQual. All the quality variables are correlated with a correlation coefficient > 0.65. This makes me wonder if OverallQual explain most of the variation in SalePrice. The GarageYrBlt (though minimal) seem to have had some influence on GarageFinish (0.51), GarageCars (0.59) and GarageArea (0.56). The correlation between BedroomAbvGr and SalePrice was negligible. The month and year the houses were sold had no effect on the SalePrice.
The strongest relationship for SalePrice was with OverallQual of the houses.
In this section, I wanted to explore at least a few of the effects of the various features together and a few others.
This plot shows that the median SalePrice was not affected by the Year or Month of Sale. This is surprising as it is known that
To begin exploring the total living area and overall quality with sale price, I plotted these three features.
This plot clearly shows an increase in LogSalePrice with Overall Quality and Living Area Caetgory. It also shows that smaller the area of the house, the lower the quality as well.
I created the next plot in an effort to look at whether there was any effect of remodelling on overall quality, sale price for each neighborhood.
In this plot, it is obvious that the sale price / square foot was not affected by remodelling. The effect of the overall quality and sale price / square foot in each neighborhood can be clearly seen in this plot.
In the next plot, I looked at the effect of Sale Price per Square foot of garage area, depending on garage type and coloured based on the garage finish.
In this plot, there is no correlation between Sale Price and area, type of Garage or Finish of Garage. However there seems to a pattern that most detached garages are unfinished.
In the next plot, I wanted to look at the relationships between Overall Quality to the other qualities.
The plot shows the correlations between the quality variables. The overall quality was higher as expected when all External, Kitchen and Basement Quality were towards Excellent.
In this plot I looked at the sale price in terms of overall quality, total baths and total rooms above grade. There is a clear increase in quality, total baths and rooms above grade with the sale price with exceptions.
The living area also determined quality and sale price. This was obvious in the first plot that looked at overall quality versus sale price for each living area category and in the last plot that considered the total baths and total rooms above grade.
The neighborhood had an effect on the sale price but it didn’t matter if the house was remodelled or not. The neighborhood Northridge Heights (NridgHt) had the highest sale price /square foot but also the best quality houses. While at the other end was the South & West of Iowa State University (SWISU) neighborhood with low quality houses at the lowest sale price / square foot.
The correlations between the quality variables could easily been seen in the second last plot.
The garage variables did not seem to have much effect on the sale price.
The distribution of the sale price of the houses sold on log scale appears to be almost normal. The price ranges from $34,900 to $755,000.
The feature “OverallQual” had the highest Pearson correlation of 0.82 with Sale Price in log scale. This plot show the line of linear regression model between Overall Quality and Sale Price.
This housing data set had 1460 houses and 81 features including “id”. I began by exploring all the 80 features (excluding Id). My main feature of interest was Sale Price. The data set looks at houses sold between 2006 and 2010 in Iowa.
I struggled with the number of features and filtering them. I was extremely surprised that the SalePrice was not affected depending on the Year it was sold. It is known that the US housing market went through a major recession in 2008[1]. Thus, the dataset could be fictional.
However, with the data I looked at, Overall Quality of the house, the living area square footage and the neighborhood of the houses were major driving factors for the sale price. This resonates with all the main ingredients that a buyer would consider.
I didn’t create a linear model because the dataset had only 1460 houses information and too many features to consider. Also due to the fact that many of them were correlated.
This dataset could be further explored for all the intercorrelations between the features (which I just touched upon) and how much they actually affect the Sale Price.
[1] https://en.wikipedia.org/wiki/Timeline_of_the_United_States_housing_bubble
MSSubClass: Identifies the type of dwelling involved in the sale.
20 1-STORY 1946 & NEWER ALL STYLES
30 1-STORY 1945 & OLDER
40 1-STORY W/FINISHED ATTIC ALL AGES
45 1-1/2 STORY - UNFINISHED ALL AGES
50 1-1/2 STORY FINISHED ALL AGES
60 2-STORY 1946 & NEWER
70 2-STORY 1945 & OLDER
75 2-1/2 STORY ALL AGES
80 SPLIT OR MULTI-LEVEL
85 SPLIT FOYER
90 DUPLEX - ALL STYLES AND AGES
120 1-STORY PUD (Planned Unit Development) - 1946 & NEWER
150 1-1/2 STORY PUD - ALL AGES
160 2-STORY PUD - 1946 & NEWER
180 PUD - MULTILEVEL - INCL SPLIT LEV/FOYER
190 2 FAMILY CONVERSION - ALL STYLES AND AGES
MSZoning: Identifies the general zoning classification of the sale.
A Agriculture
C Commercial
FV Floating Village Residential
I Industrial
RH Residential High Density
RL Residential Low Density
RP Residential Low Density Park
RM Residential Medium Density
LotFrontage: Linear feet of street connected to property
LotArea: Lot size in square feet
Street: Type of road access to property
Grvl Gravel
Pave Paved
Alley: Type of alley access to property
Grvl Gravel
Pave Paved
NA No alley access
LotShape: General shape of property
Reg Regular
IR1 Slightly irregular
IR2 Moderately Irregular
IR3 Irregular
LandContour: Flatness of the property
Lvl Near Flat/Level
Bnk Banked - Quick and significant rise from street grade to building
HLS Hillside - Significant slope from side to side
Low Depression
Utilities: Type of utilities available
AllPub All public Utilities (E,G,W,& S)
NoSewr Electricity, Gas, and Water (Septic Tank)
NoSeWa Electricity and Gas Only
ELO Electricity only
LotConfig: Lot configuration
Inside Inside lot
Corner Corner lot
CulDSac Cul-de-sac
FR2 Frontage on 2 sides of property
FR3 Frontage on 3 sides of property
LandSlope: Slope of property
Gtl Gentle slope
Mod Moderate Slope
Sev Severe Slope
Neighborhood: Physical locations within Ames city limits
Blmngtn Bloomington Heights
Blueste Bluestem
BrDale Briardale
BrkSide Brookside
ClearCr Clear Creek
CollgCr College Creek
Crawfor Crawford
Edwards Edwards
Gilbert Gilbert
IDOTRR Iowa DOT and Rail Road
MeadowV Meadow Village
Mitchel Mitchell
Names North Ames
NoRidge Northridge
NPkVill Northpark Villa
NridgHt Northridge Heights
NWAmes Northwest Ames
OldTown Old Town
SWISU South & West of Iowa State University
Sawyer Sawyer
SawyerW Sawyer West
Somerst Somerset
StoneBr Stone Brook
Timber Timberland
Veenker Veenker
Condition1: Proximity to various conditions
Artery Adjacent to arterial street
Feedr Adjacent to feeder street
Norm Normal
RRNn Within 200' of North-South Railroad
RRAn Adjacent to North-South Railroad
PosN Near positive off-site feature--park, greenbelt, etc.
PosA Adjacent to postive off-site feature
RRNe Within 200' of East-West Railroad
RRAe Adjacent to East-West Railroad
Condition2: Proximity to various conditions (if more than one is present)
Artery Adjacent to arterial street
Feedr Adjacent to feeder street
Norm Normal
RRNn Within 200' of North-South Railroad
RRAn Adjacent to North-South Railroad
PosN Near positive off-site feature--park, greenbelt, etc.
PosA Adjacent to postive off-site feature
RRNe Within 200' of East-West Railroad
RRAe Adjacent to East-West Railroad
BldgType: Type of dwelling
1Fam Single-family Detached
2FmCon Two-family Conversion; originally built as one-family dwelling
Duplx Duplex
TwnhsE Townhouse End Unit
TwnhsI Townhouse Inside Unit
HouseStyle: Style of dwelling
1Story One story
1.5Fin One and one-half story: 2nd level finished
1.5Unf One and one-half story: 2nd level unfinished
2Story Two story
2.5Fin Two and one-half story: 2nd level finished
2.5Unf Two and one-half story: 2nd level unfinished
SFoyer Split Foyer
SLvl Split Level
OverallQual: Rates the overall material and finish of the house
10 Very Excellent
9 Excellent
8 Very Good
7 Good
6 Above Average
5 Average
4 Below Average
3 Fair
2 Poor
1 Very Poor
OverallCond: Rates the overall condition of the house
10 Very Excellent
9 Excellent
8 Very Good
7 Good
6 Above Average
5 Average
4 Below Average
3 Fair
2 Poor
1 Very Poor
YearBuilt: Original construction date
YearRemodAdd: Remodel date (same as construction date if no remodeling or additions)
RoofStyle: Type of roof
Flat Flat
Gable Gable
Gambrel Gabrel (Barn)
Hip Hip
Mansard Mansard
Shed Shed
RoofMatl: Roof material
ClyTile Clay or Tile
CompShg Standard (Composite) Shingle
Membran Membrane
Metal Metal
Roll Roll
Tar&Grv Gravel & Tar
WdShake Wood Shakes
WdShngl Wood Shingles
Exterior1st: Exterior covering on house
AsbShng Asbestos Shingles
AsphShn Asphalt Shingles
BrkComm Brick Common
BrkFace Brick Face
CBlock Cinder Block
CemntBd Cement Board
HdBoard Hard Board
ImStucc Imitation Stucco
MetalSd Metal Siding
Other Other
Plywood Plywood
PreCast PreCast
Stone Stone
Stucco Stucco
VinylSd Vinyl Siding
Wd Sdng Wood Siding
WdShing Wood Shingles
Exterior2nd: Exterior covering on house (if more than one material)
AsbShng Asbestos Shingles
AsphShn Asphalt Shingles
BrkComm Brick Common
BrkFace Brick Face
CBlock Cinder Block
CemntBd Cement Board
HdBoard Hard Board
ImStucc Imitation Stucco
MetalSd Metal Siding
Other Other
Plywood Plywood
PreCast PreCast
Stone Stone
Stucco Stucco
VinylSd Vinyl Siding
Wd Sdng Wood Siding
WdShing Wood Shingles
MasVnrType: Masonry veneer type
BrkCmn Brick Common
BrkFace Brick Face
CBlock Cinder Block
None None
Stone Stone
MasVnrArea: Masonry veneer area in square feet
ExterQual: Evaluates the quality of the material on the exterior
Ex Excellent
Gd Good
TA Average/Typical
Fa Fair
Po Poor
ExterCond: Evaluates the present condition of the material on the exterior
Ex Excellent
Gd Good
TA Average/Typical
Fa Fair
Po Poor
Foundation: Type of foundation
BrkTil Brick & Tile
CBlock Cinder Block
PConc Poured Contrete
Slab Slab
Stone Stone
Wood Wood
BsmtQual: Evaluates the height of the basement
Ex Excellent (100+ inches)
Gd Good (90-99 inches)
TA Typical (80-89 inches)
Fa Fair (70-79 inches)
Po Poor (<70 inches
NA No Basement
BsmtCond: Evaluates the general condition of the basement
Ex Excellent
Gd Good
TA Typical - slight dampness allowed
Fa Fair - dampness or some cracking or settling
Po Poor - Severe cracking, settling, or wetness
NA No Basement
BsmtExposure: Refers to walkout or garden level walls
Gd Good Exposure
Av Average Exposure (split levels or foyers typically score average or above)
Mn Mimimum Exposure
No No Exposure
NA No Basement
BsmtFinType1: Rating of basement finished area
GLQ Good Living Quarters
ALQ Average Living Quarters
BLQ Below Average Living Quarters
Rec Average Rec Room
LwQ Low Quality
Unf Unfinshed
NA No Basement
BsmtFinSF1: Type 1 finished square feet
BsmtFinType2: Rating of basement finished area (if multiple types)
GLQ Good Living Quarters
ALQ Average Living Quarters
BLQ Below Average Living Quarters
Rec Average Rec Room
LwQ Low Quality
Unf Unfinshed
NA No Basement
BsmtFinSF2: Type 2 finished square feet
BsmtUnfSF: Unfinished square feet of basement area
TotalBsmtSF: Total square feet of basement area
Heating: Type of heating
Floor Floor Furnace
GasA Gas forced warm air furnace
GasW Gas hot water or steam heat
Grav Gravity furnace
OthW Hot water or steam heat other than gas
Wall Wall furnace
HeatingQC: Heating quality and condition
Ex Excellent
Gd Good
TA Average/Typical
Fa Fair
Po Poor
CentralAir: Central air conditioning
N No
Y Yes
Electrical: Electrical system
SBrkr Standard Circuit Breakers & Romex
FuseA Fuse Box over 60 AMP and all Romex wiring (Average)
FuseF 60 AMP Fuse Box and mostly Romex wiring (Fair)
FuseP 60 AMP Fuse Box and mostly knob & tube wiring (poor)
Mix Mixed
1stFlrSF: First Floor square feet
2ndFlrSF: Second floor square feet
LowQualFinSF: Low quality finished square feet (all floors)
GrLivArea: Above grade (ground) living area square feet
BsmtFullBath: Basement full bathrooms
BsmtHalfBath: Basement half bathrooms
FullBath: Full bathrooms above grade
HalfBath: Half baths above grade
Bedroom: Bedrooms above grade (does NOT include basement bedrooms)
Kitchen: Kitchens above grade
KitchenQual: Kitchen quality
Ex Excellent
Gd Good
TA Typical/Average
Fa Fair
Po Poor
TotRmsAbvGrd: Total rooms above grade (does not include bathrooms)
Functional: Home functionality (Assume typical unless deductions are warranted)
Typ Typical Functionality
Min1 Minor Deductions 1
Min2 Minor Deductions 2
Mod Moderate Deductions
Maj1 Major Deductions 1
Maj2 Major Deductions 2
Sev Severely Damaged
Sal Salvage only
Fireplaces: Number of fireplaces
FireplaceQu: Fireplace quality
Ex Excellent - Exceptional Masonry Fireplace
Gd Good - Masonry Fireplace in main level
TA Average - Prefabricated Fireplace in main living area or Masonry Fireplace in basement
Fa Fair - Prefabricated Fireplace in basement
Po Poor - Ben Franklin Stove
NA No Fireplace
GarageType: Garage location
2Types More than one type of garage
Attchd Attached to home
Basment Basement Garage
BuiltIn Built-In (Garage part of house - typically has room above garage)
CarPort Car Port
Detchd Detached from home
NA No Garage
GarageYrBlt: Year garage was built
GarageFinish: Interior finish of the garage
Fin Finished
RFn Rough Finished
Unf Unfinished
NA No Garage
GarageCars: Size of garage in car capacity
GarageArea: Size of garage in square feet
GarageQual: Garage quality
Ex Excellent
Gd Good
TA Typical/Average
Fa Fair
Po Poor
NA No Garage
GarageCond: Garage condition
Ex Excellent
Gd Good
TA Typical/Average
Fa Fair
Po Poor
NA No Garage
PavedDrive: Paved driveway
Y Paved
P Partial Pavement
N Dirt/Gravel
WoodDeckSF: Wood deck area in square feet
OpenPorchSF: Open porch area in square feet
EnclosedPorch: Enclosed porch area in square feet
3SsnPorch: Three season porch area in square feet
ScreenPorch: Screen porch area in square feet
PoolArea: Pool area in square feet
PoolQC: Pool quality
Ex Excellent
Gd Good
TA Average/Typical
Fa Fair
NA No Pool
Fence: Fence quality
GdPrv Good Privacy
MnPrv Minimum Privacy
GdWo Good Wood
MnWw Minimum Wood/Wire
NA No Fence
MiscFeature: Miscellaneous feature not covered in other categories
Elev Elevator
Gar2 2nd Garage (if not described in garage section)
Othr Other
Shed Shed (over 100 SF)
TenC Tennis Court
NA None
MiscVal: $Value of miscellaneous feature
MoSold: Month Sold (MM)
YrSold: Year Sold (YYYY)
SaleType: Type of sale
WD Warranty Deed - Conventional
CWD Warranty Deed - Cash
VWD Warranty Deed - VA Loan
New Home just constructed and sold
COD Court Officer Deed/Estate
Con Contract 15% Down payment regular terms
ConLw Contract Low Down payment and low interest
ConLI Contract Low Interest
ConLD Contract Low Down
Oth Other
SaleCondition: Condition of sale
Normal Normal Sale
Abnorml Abnormal Sale - trade, foreclosure, short sale
AdjLand Adjoining Land Purchase
Alloca Allocation - two linked properties with separate deeds, typically condo with a garage unit
Family Sale between family members
Partial Home was not completed when last assessed (associated with New Homes)